Kernel PCA and De-Noising in Feature Spaces
نویسندگان
چکیده
Kernel PCA as a nonlinear feature extractor has proven powerful as a preprocessing step for classification algorithms. But it can also be considered as a natural generalization of linear principal component analysis. This gives rise to the question how to use nonlinear features for data compression, reconstruction, and de-noising, applications common in linear PCA. This is a nontrivial task, as the results provided by kernel PCA live in some high dimensional feature space and need not have pre-images in input space. This work presents ideas for finding approximate pre-images, focusing on Gaussian kernels, and shows experimental results using these pre-images in data reconstruction and de-noising on toy examples as well as on real world data. 1 PCA and Feature Spaces Principal Component Analysis (PCA) (e.g. [3]) is an orthogonal basis transformation. The new basis is found by diagonalizing the centered covariance matrix of a data set fxk 2 RN jk = 1; : : : ; `g, defined by C = h(xi hxki)(xi hxki)T i. The coordinates in the Eigenvector basis are called principal components. The size of an Eigenvalue corresponding to an Eigenvector v of C equals the amount of variance in the direction of v. Furthermore, the directions of the first n Eigenvectors corresponding to the biggest n Eigenvalues cover as much variance as possible by n orthogonal directions. In many applications they contain the most interesting information: for instance, in data compression, where we project onto the directions with biggest variance to retain as much information as possible, or in de-noising, where we deliberately drop directions with small variance. Clearly, one cannot assert that linear PCA will always detect all structure in a given data set. By the use of suitable nonlinear features, one can extract more information. Kernel PCA is very well suited to extract interesting nonlinear structures in the data [9]. The purpose of this work is therefore (i) to consider nonlinear de-noising based on Kernel PCA and (ii) to clarify the connection between feature space expansions and meaningful patterns in input space. Kernel PCA first maps the data into some feature space F via a (usually nonlinear) function and then performs linear PCA on the mapped data. As the feature space F might be very high dimensional (e.g. when mapping into the space of all possible d-th order monomials of input space), kernel PCA employs Mercer kernels instead of carrying out the mapping explicitly. A Mercer kernel is a function k(x;y) which for all data sets fxig gives rise to a positive matrix Kij = k(xi;xj) [6]. One can show that using k instead of a dot product in input space corresponds to mapping the data with some to a feature space F [1], i.e. k(x;y) = ( (x) (y)). Kernels that have proven useful include Gaussian kernels k(x;y) = exp( kx yk2=c) and polynomial kernels k(x;y) = (x y)d. Clearly, all algorithms that can be formulated in terms of dot products, e.g. Support Vector Machines [1], can be carried out in some feature space F without mapping the data explicitly. All these algorithms construct their solutions as expansions in the potentially infinite-dimensional feature space. The paper is organized as follows: in the next section, we briefly describe the kernel PCA algorithm. In section 3, we present an algorithm for finding approximate pre-images of expansions in feature space. Experimental results on toy and real world data are given in section 4, followed by a discussion of our findings (section 5). 2 Kernel PCA and Reconstruction To perform PCA in feature space, we need to find Eigenvalues > 0 and Eigenvectors V 2 Fnf0g satisfying V = CV with C = h (xk) (xk)T i.1 Substituting C into the Eigenvector equation, we note that all solutions V must lie in the span of -images of the training data. This implies that we can consider the equivalent system ( (xk) V ) = ( (xk) CV ) for all k = 1; : : : ; ` (1) and that there exist coefficients 1; : : : ; ` such that V =Xì=1 i (xi) (2) Substituting C and (2) into (1), and defining an ` `matrix K byKij := ( (xi) (xj)) = k(xi;xj), we arrive at a problem which is cast in terms of dot products: solve ` = K (3) where = ( 1; : : : ; `)T (for details see [7]). Normalizing the solutions V k, i.e. (V k V k) = 1, translates into k( k k) = 1. To extract nonlinear principal components for the -image of a test point x we compute the projection onto the k-th component by k := (V k (x)) =Pì=1 ki k(x;xi). For feature extraction, we thus have to evaluate ` kernel functions instead of a dot product in F, which is expensive if F is high-dimensional (or, as for Gaussian kernels, infinite-dimensional). To reconstruct the -image of a vector x from its projections k onto the first n principal components in F (assuming that the Eigenvectors are ordered by decreasing Eigenvalue size), we define a projection operator Pn by Pn (x) = n X k=1 kV k (4) If n is large enough to take into account all directions belonging to Eigenvectors with nonzero Eigenvalue, we have Pn (xi) = (xi). Otherwise (kernel) PCA still satisfies (i) that the overall squared reconstruction error Pi kPn (xi) (xi)k2 is minimal and (ii) the retained variance is maximal among all projections onto orthogonal directions in F. In common applications, however, we are interested in a reconstruction in input space rather than in F. The present work attempts to achieve this by computing a vector z satisfying (z) = Pn (x). The hope is that for the kernel used, such a z will be a good approximation of x in input space. However, (i) such a z will not always exist and (ii) if it exists, For simplicity, we assume that the mapped data are centered in F. Otherwise, we have to go through the same algebra using e (x) := (x) h (xi)i. it need be not unique.2 As an example for (i), we consider a possible representation of F. One can show [7] that can be thought of as a map (x) = k(x; :) into a Hilbert space Hk of functions Pi i k(xi; :) with a dot product satisfying (k(x; :) k(y; :)) = k(x;y). Then Hk is called reproducing kernel Hilbert space (e.g. [6]). Now, for a Gaussian kernel, Hk contains all linear superpositions of Gaussian bumps on RN (plus limit points), whereas by definition of only single bumps k(x; :) have pre-images under . When the vector Pn (x) has no pre-image z we try to approximate it by minimizing (z) = k (z) Pn (x)k2 (5) This is a special case of the reduced set method [2]. Replacing terms independent of z by , we obtain (z) = k (z)k2 2( (z) Pn (x)) + (6) Substituting (4) and (2) into (6), we arrive at an expression which is written in terms of dot products. Consequently, we can introduce a kernel to obtain a formula for (and thus rz ) which does not rely on carrying out explicitly (z) = k(z; z) 2 n Xk=1 k X̀i=1 ki k(z;xi) + (7) 3 Pre-Images for Gaussian Kernels To optimize (7) we employed standard gradient descent methods. If we restrict our attention to kernels of the form k(x;y) = k(kx yk2) (and thus satisfying k(x;x) const. for all x), an optimal z can be determined as follows (cf. [8]): we deduce from (6) that we have to maximize (z) = ( (z) Pn (x)) + 0 = X̀i=1 i k(z;xi) + 0 (8) where we set i = Pnk=1 k ki (for some 0 independent of z). For an extremum, the gradient with respect to z has to vanish: rz (z) =Pì=1 ik0(kz xik2)(z xi) = 0. This leads to a necessary condition for the extremum: z = Pi ixi=Pj j , with i = ik0(kz xik2). For a Gaussian kernel k(x;y) = exp( kx yk2=c) we get z = Pì=1 i exp( kz xik2=c)xi Pì=1 i exp( kz xik2=c) : (9) We note that the denominator equals ( (z) Pn (x)) (cf. (8)). Making the assumption that Pn (x) 6= 0, we have ( (x) Pn (x)) = (Pn (x) Pn (x)) > 0. As k is smooth, we conclude that there exists a neighborhood of the extremum of (8) in which the denominator of (9) is 6= 0. Thus we can devise an iteration scheme for z by zt+1 = Pì=1 i exp( kzt xik2=c)xi Pì=1 i exp( kzt xik2=c) (10) Numerical instabilities related to ( (z) Pn (x)) being small can be dealt with by restarting the iteration with a different starting value. Furthermore we note that any fixed-point of (10) will be a linear combination of the kernel PCA training data xi. If we regard (10) in the context of clustering we see that it resembles an iteration step for the estimation of If the kernel allows reconstruction of the dot–product in input space, and under the assumption that a pre–image exists, it is possible to construct it explicitly (cf. [7]). But clearly, these conditions do not hold true in general. the center of a single Gaussian cluster. The weights or ‘probabilities’ i reflect the (anti-) correlation between the amount of (x) in Eigenvector direction V k and the contribution of (xi) to this Eigenvector. So the ‘cluster center’ z is drawn towards training patterns with positive i and pushed away from those with negative i, i.e. for a fixed-point z1 the influence of training patterns with smaller distance to x will tend to be bigger.
منابع مشابه
De-noising and Recovering Images Based on Kernel PCA Theory
ABSTRACT Principal Component Analysis (PCA) is a basis transformation to diagonalize an estimate of the covariance matrix of input data and, the new coordinates in the Eigenvector basis are called principal components. Since Kernel PCA is just a PCA in feature space F , the projection of an image in input space can be reconstructed from its principal components in feature space. This enables us...
متن کاملKernel peA and De-Noising in Feature Spaces
Kernel PCA as a nonlinear feature extractor has proven powerful as a preprocessing step for classification algorithms. But it can also be considered as a natural generalization of linear principal component analysis. This gives rise to the question how to use nonlinear features for data compression, reconstruction, and de-noising, applications common in linear PCA. This is a nontrivial task, as...
متن کاملKernel PCA for Feature Extraction and De - Noising in 34 Nonlinear Regression
39 40 41 In this paper, we propose the application of the 42 Kernel Principal Component Analysis (PCA) tech43 nique for feature selection in a high-dimensional 44 feature space, where input variables are mapped by 45 a Gaussian kernel. The extracted features are 46 employed in the regression problems of chaotic 47 Mackey–Glass time-series prediction in a noisy 48 environment and estimating huma...
متن کاملKernel Methods for Machine Learning with Life Science Applications
The main challenge in de-noising by kernel Principal Component Analysis (PCA) is the mapping of de-noised feature space points back into input space, also referred to as “the pre-image problem”. Since the feature space mapping is typically not bijective, preimage estimation is inherently illposed. As a consequence the most widely used estimation schemes lack stability. A common way to stabilize...
متن کاملRobust De-noising by Kernel PCA
Recently, kernel Principal Component Analysis is becoming a popular technique for feature extraction. It enables us to extract nonlinear features and therefore performs as a powerful preprocessing step for classification. There is one drawback, however, that extracted feature components are sensitive to outliers contained in data. This is a characteristic common to all PCA-based techniques. In ...
متن کاملStatistical Shape Analysis using Kernel PCA
Mercer kernels are used for a wide range of image and signal processing tasks like de-noising, clustering, discriminant analysis etc. These algorithms construct their solutions in terms of the expansions in a high-dimensional feature space F. However, many applications like kernel PCA (principal component analysis) can be used more effectively if a pre-image of the projection in the feature spa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998